10 research outputs found
Exploiting Pretrained Biochemical Language Models for Targeted Drug Design
Motivation: The development of novel compounds targeting proteins of interest
is one of the most important tasks in the pharmaceutical industry. Deep
generative models have been applied to targeted molecular design and have shown
promising results. Recently, target-specific molecule generation has been
viewed as a translation between the protein language and the chemical language.
However, such a model is limited by the availability of interacting
protein-ligand pairs. On the other hand, large amounts of unlabeled protein
sequences and chemical compounds are available and have been used to train
language models that learn useful representations. In this study, we propose
exploiting pretrained biochemical language models to initialize (i.e. warm
start) targeted molecule generation models. We investigate two warm start
strategies: (i) a one-stage strategy where the initialized model is trained on
targeted molecule generation (ii) a two-stage strategy containing a
pre-finetuning on molecular generation followed by target specific training. We
also compare two decoding strategies to generate compounds: beam search and
sampling.
Results: The results show that the warm-started models perform better than a
baseline model trained from scratch. The two proposed warm-start strategies
achieve similar results to each other with respect to widely used metrics from
benchmarks. However, docking evaluation of the generated compounds for a number
of novel proteins suggests that the one-stage strategy generalizes better than
the two-stage strategy. Additionally, we observe that beam search outperforms
sampling in both docking evaluation and benchmark metrics for assessing
compound quality.
Availability and implementation: The source code is available at
https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials
are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145Comment: 12 pages, to appear in Bioinformatic
Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties
Machine learning models have found numerous successful applications in
computational drug discovery. A large body of these models represents molecules
as sequences since molecular sequences are easily available, simple, and
informative. The sequence-based models often segment molecular sequences into
pieces called chemical words (analogous to the words that make up sentences in
human languages) and then apply advanced natural language processing techniques
for tasks such as drug design, property prediction, and
binding affinity prediction. However, the chemical characteristics and
significance of these building blocks, chemical words, remain unexplored. This
study aims to investigate the chemical vocabularies generated by popular
subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece,
and Unigram, and identify key chemical words associated with protein-ligand
binding. To this end, we build a language-inspired pipeline that treats high
affinity ligands of protein targets as documents and selects key chemical words
making up those ligands based on tf-idf weighting. Further, we conduct case
studies on a number of protein families to analyze the impact of key chemical
words on binding. Through our analysis, we find that these key chemical words
are specific to protein targets and correspond to known pharmacophores and
functional groups. Our findings will help shed light on the chemistry captured
by the chemical words, and by machine learning models for drug discovery at
large.Comment: 16 pages, 11 figures, new computational analysis and extended case
studie
Neuroinflammation, Energy and Sphingolipid Metabolism Biomarkers Are Revealed by Metabolic Modeling of Autistic Brains
Autism spectrum disorders (ASD) are a heterogeneous group of neurodevelopmental disorders generally characterized by repetitive behaviors and difficulties in communication and social behavior. Despite its heterogeneous nature, several metabolic dysregulations are prevalent in individuals with ASD. This work aims to understand ASD brain metabolism by constructing an ASD-specific prefrontal cortex genome-scale metabolic model (GEM) using transcriptomics data to decipher novel neuroinflammatory biomarkers. The healthy and ASD-specific models are compared via uniform sampling to identify ASD-exclusive metabolic features. Noticeably, the results of our simulations and those found in the literature are comparable, supporting the accuracy of our reconstructed ASD model. We identified that several oxidative stress, mitochondrial dysfunction, and inflammatory markers are elevated in ASD. While oxidative phosphorylation fluxes were similar for healthy and ASD-specific models, and the fluxes through the pathway were nearly undisturbed, the tricarboxylic acid (TCA) fluxes indicated disruptions in the pathway. Similarly, the secretions of mitochondrial dysfunction markers such as pyruvate are found to be higher, as well as the activities of oxidative stress marker enzymes like alanine and aspartate aminotransferases (ALT and AST) and glutathione-disulfide reductase (GSR). We also detected abnormalities in the sphingolipid metabolism, which has been implicated in many inflammatory and immune processes, but its relationship with ASD has not been thoroughly explored in the existing literature. We suggest that important sphingolipid metabolites, such as sphingosine-1-phosphate (S1P), ceramide, and glucosylceramide, may be promising biomarkers for the diagnosis of ASD and provide an opportunity for the adoption of early intervention for young children
Identification of Therapeutic Targets for Medulloblastoma by Tissue-Specific Genome-Scale Metabolic Model
Medulloblastoma (MB), occurring in the cerebellum, is the most common childhood brain tumor. Because conventional methods decline life quality and endanger children with detrimental side effects, computer models are needed to imitate the characteristics of cancer cells and uncover effective therapeutic targets with minimum toxic effects on healthy cells. In this study, metabolic changes specific to MB were captured by the genome-scale metabolic brain model integrated with transcriptome data. To determine the roles of sphingolipid metabolism in proliferation and metastasis in the cancer cell, 79 reactions were incorporated into the MB model. The pathways employed by MB without a carbon source and the link between metastasis and the Warburg effect were examined in detail. To reveal therapeutic targets for MB, biomass-coupled reactions, the essential genes/gene products, and the antimetabolites, which might deplete the use of metabolites in cells by triggering competitive inhibition, were determined. As a result, interfering with the enzymes associated with fatty acid synthesis (FAs) and the mevalonate pathway in cholesterol synthesis, suppressing cardiolipin production, and tumor-supporting sphingolipid metabolites might be effective therapeutic approaches for MB. Moreover, decreasing the activity of succinate synthesis and GABA-catalyzing enzymes concurrently might be a promising strategy for metastatic MB
A network-based approach on elucidating the multi-faceted nature of chronological aging in S. cerevisiae.
BACKGROUND: Cellular mechanisms leading to aging and therefore increasing susceptibility to age-related diseases are a central topic of research since aging is the ultimate, yet not understood mechanism of the fate of a cell. Studies with model organisms have been conducted to ellucidate these mechanisms, and chronological aging of yeast has been extensively used as a model for oxidative stress and aging of postmitotic tissues in higher eukaryotes. METHODOLOGY/PRINCIPAL FINDINGS: The chronological aging network of yeast was reconstructed by integrating protein-protein interaction data with gene ontology terms. The reconstructed network was then statistically "tuned" based on the betweenness centrality values of the nodes to compensate for the computer automated method. Both the originally reconstructed and tuned networks were subjected to topological and modular analyses. Finally, an ultimate "heart" network was obtained via pooling the step specific key proteins, which resulted from the decomposition of the linear paths depicting several signaling routes in the tuned network. CONCLUSIONS/SIGNIFICANCE: The reconstructed networks are of scale-free and hierarchical nature, following a power law model with γ = 1.49. The results of modular and topological analyses verified that the tuning method was successful. The significantly enriched gene ontology terms of the modular analysis confirmed also that the multifactorial nature of chronological aging was captured by the tuned network. The interplay between various signaling pathways such as TOR, Akt/PKB and cAMP/Protein kinase A was summarized in the "heart" network originated from linear path analysis. The deletion of four genes, TCB3, SNA3, PST2 and YGR130C, was found to increase the chronological life span of yeast. The reconstructed networks can also give insight about the effect of other cellular machineries on chronological aging by targeting different signaling pathways in the linear path analysis, along with unraveling of novel proteins playing part in these pathways
Circadian clock crosstalks with autism
Abstract Background The mechanism underlying autism spectrum disorder (ASD) remains incompletely understood, but researchers have identified over a thousand genes involved in complex interactions within the brain, nervous, and immune systems, particularly during the mechanism of brain development. Various contributory environmental effects including circadian rhythm have also been studied in ASD. Thus, capturing the global picture of the ASD‐clock network in combined form is critical. Methods We reconstructed the protein–protein interaction network of ASD and circadian rhythm to understand the connection between autism and the circadian clock. A graph theoretical study is undertaken to evaluate whether the network attributes are biologically realistic. The gene ontology enrichment analyses provide information about the most important biological processes. Results This study takes a fresh look at metabolic mechanisms and the identification of potential key proteins/pathways (ribosome biogenesis, oxidative stress, insulin/IGF pathway, Wnt pathway, and mTOR pathway), as well as the effects of specific conditions (such as maternal stress or disruption of circadian rhythm) on the development of ASD due to environmental factors. Conclusion Understanding the relationship between circadian rhythm and ASD provides insight into the involvement of these essential pathways in the pathogenesis/etiology of ASD, as well as potential early intervention options and chronotherapeutic strategies for treating or preventing the neurodevelopmental disorder